# Data Description

Our analyses use two primary datasets.

* **`df_health_risk.csv`**
  - This dataset is derived from the work of [Obermeyer et al. (2019)](https://www.science.org/doi/10.1126/science.aax2342?ijkey=EUVkYeQaORypo&keytype=ref&siteid=sci).
  - We generate `df_health_risk.csv` using the `preprocess` function in `preprocess.py`, based on `data_new.csv` from their data.
  - The original dataset is available in their [Git repository](https://gitlab.com/labsysmed/dissecting-bias/-/tree/master/data?ref_type=heads).
  - It contains 48,784 entries, including 5,582 entries for Black individuals and 43,202 entries for White individuals.

* **`diabetes_fairlearn.csv`**
  - This dataset originates from [Fairlearn](https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html).
  - The processing steps are detailed in `diabetes_fairlearn.ipynb`, which describes how we preprocess the original data.
  - After preprocessing, the dataset comprises 95,309 entries, with 51,417 entries for women and 43,892 entries for men.
